Skip to content

XS✔ ◾ feat: HA & Raft quorum safety guardrails#2

Merged
dimoschi merged 17 commits intomainfrom
feat/ha-quorum-safety
Mar 20, 2026
Merged

XS✔ ◾ feat: HA & Raft quorum safety guardrails#2
dimoschi merged 17 commits intomainfrom
feat/ha-quorum-safety

Conversation

@dimoschi
Copy link
Collaborator

@dimoschi dimoschi commented Mar 20, 2026

Summary

  • PDB enabled by default for multi-node clusters (replicaCount > 1), automatically skipped for single-replica deployments
  • Auto-calculated maxUnavailable for both PDB and StatefulSet updateStrategy based on Raft fault tolerance (floor(replicaCount/2)), with user override support and quorum validation
  • Explicit updateStrategy on StatefulSet with RollingUpdate default and OnDelete support, with schema-level enum and if/then/else conditional validation
  • Built-in soft pod anti-affinity (preferredDuringSchedulingIgnoredDuringExecution) on kubernetes.io/hostname for multi-node clusters
  • Template-level validation that rejects maxUnavailable values exceeding Raft fault tolerance and invalid updateStrategy.type values (defense in depth)
  • Remove hardcoded chart version from test suites
  • Bump chart version to 1.1.0

Quorum safety

replicaCount Raft fault tolerance Auto maxUnavailable
1 0 (no HA) N/A (skipped)
3 1 1
5 2 2
7 3 3

Both pdb.maxUnavailable and updateStrategy.rollingUpdate.maxUnavailable default to "auto" (auto-calculated as floor(replicaCount/2)). Users can set an integer value: 0 to block all voluntary disruptions, or any positive value up to floor(replicaCount/2). The template fails with a descriptive error if the override exceeds the Raft fault tolerance.

Breaking changes

  • pdb.enabled default changed from false to true
  • pdb.maxUnavailable default changed from 1 to "auto" (auto-calculate as floor(replicaCount/2))
  • updateStrategy block added with type: RollingUpdate and rollingUpdate.maxUnavailable: "auto"
  • Built-in soft pod anti-affinity is now injected when affinity is unset or empty ({}) and replicaCount > 1

Schema validation

  • updateStrategy.type enforced as enum: RollingUpdate or OnDelete
  • maxUnavailable validated as anyOf: [integer >= 0, "auto"]
  • rollingUpdate conditionally required only when type is RollingUpdate (via if/then)
  • affinity typed as object

Test plan

  • 91 helm-unittest tests pass (12 PDB + 48 StatefulSet + 31 others)
  • helm lint . passes
  • Schema and docs regenerated
  • CI passes on PR

@github-actions
Copy link

github-actions bot commented Mar 20, 2026

PR Metrics

Thanks for keeping your pull request small.
Thanks for adding tests.

Lines
Product Code 123
Test Code 252
Subtotal 375
Ignored Code 115
Total 490

Metrics computed by PR Metrics. Add it to your Azure DevOps and GitHub PRs!

@github-actions github-actions bot changed the title feat: HA & Raft quorum safety guardrails XS✔ ◾ feat: HA & Raft quorum safety guardrails Mar 20, 2026
@dimoschi dimoschi requested review from Copilot and vlasopoulos March 20, 2026 07:36

This comment was marked as outdated.

@dimoschi dimoschi requested a review from a team as a code owner March 20, 2026 08:00
…ffinity default to {}

maxUnavailable: 0 means auto-calculate as floor(replicaCount/2). This gives
proper integer typing in schema and docs. affinity: {} enables correct object
type inference by helm-docs.

This comment was marked as duplicate.

…ateStrategy.type

- Change maxUnavailable default from 0 to -1 so users can explicitly set 0
  to block all voluntary disruptions
- Add template validation for updateStrategy.type (must be RollingUpdate or OnDelete)
- Fix changelog to reflect actual defaults (was incorrectly referencing null)
Copy link

@vlasopoulos vlasopoulos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for discussion,
since we are in a flimsy dynamically typed template world that becomes strict at render (don't get me started), we could set a trend of values making sense on their own.

Comment on lines +4 to +7
{{- $maxUnavailable := int .Values.pdb.maxUnavailable -}}
{{- if lt $maxUnavailable 0 }}
{{- $maxUnavailable = $maxFault -}}
{{- end }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{- $maxUnavailable := int .Values.pdb.maxUnavailable -}}
{{- if lt $maxUnavailable 0 }}
{{- $maxUnavailable = $maxFault -}}
{{- end }}
{{- $maxUnavailable := .Values.pdb.maxUnavailable }}
{{- if and (kindIs "string" .Values.pdb.maxUnavailable) (eq .Values.pdb.maxUnavailable "auto") }}
{{- $maxUnavailable = $maxFault }}
{{- end }}

values.yaml Outdated
Comment on lines +176 to +183
# type: integer
# minimum: 1
# minimum: -1
# @schema
# -- Maximum number of pods that can be unavailable during disruption
maxUnavailable: 1
# -- Maximum number of pods that can be unavailable during disruption.
# Set to -1 (default) to auto-calculate as floor(replicaCount/2), preserving Raft quorum.
# Set to 0 to block all voluntary disruptions. Any positive value is used directly
# and must not exceed floor(replicaCount/2).
maxUnavailable: -1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# type: integer
# minimum: 1
# minimum: -1
# @schema
# -- Maximum number of pods that can be unavailable during disruption
maxUnavailable: 1
# -- Maximum number of pods that can be unavailable during disruption.
# Set to -1 (default) to auto-calculate as floor(replicaCount/2), preserving Raft quorum.
# Set to 0 to block all voluntary disruptions. Any positive value is used directly
# and must not exceed floor(replicaCount/2).
maxUnavailable: -1
# type: integer, string
# minimum: -1
# @schema
# -- Maximum number of pods that can be unavailable during disruption.
# Set to "auto" (default) to auto-calculate as floor(replicaCount/2), preserving Raft quorum.
# Set to 0 to block all voluntary disruptions. Any positive value is used directly
# and must not exceed floor(replicaCount/2).
maxUnavailable: auto

…else for updateStrategy

- maxUnavailable defaults to -1 (auto-calculate), 0 blocks all disruptions
- Schema enforces updateStrategy.type enum (RollingUpdate/OnDelete)
- Schema conditionally requires rollingUpdate only for RollingUpdate type
- Template validation kept as defense in depth
- Fix changelog to reflect actual defaults
@dimoschi
Copy link
Collaborator Author

dimoschi commented Mar 20, 2026

Just for discussion, since we are in a flimsy dynamically typed template world that becomes strict at render (don't get me started), we could set a trend of values making sense on their own.

I've missed your comment and went on a different way. Do you think that makes more sense using -1 for auto? Regardless, I went with your way, even though it's not so Helm friendly, it looks more intuitive.

…ma validation

- maxUnavailable defaults to "auto" (auto-calculate as floor(replicaCount/2))
- Integer values are used directly, 0 blocks all voluntary disruptions
- Schema enforces anyOf [integer >= 0, "auto"] for maxUnavailable
- Schema enforces updateStrategy.type enum (RollingUpdate/OnDelete)
- Schema uses if/then to conditionally require rollingUpdate for RollingUpdate
- Update changelog to reflect "auto" defaults
@dimoschi dimoschi requested review from Copilot and vlasopoulos March 20, 2026 10:32

This comment was marked as outdated.

This comment was marked as outdated.

Copy link

@vlasopoulos vlasopoulos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quorum approves

@dimoschi dimoschi merged commit eb58901 into main Mar 20, 2026
13 checks passed
@dimoschi dimoschi deleted the feat/ha-quorum-safety branch March 20, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants